GTM2016: 6TH GLOBAL TECHMINING CONFERENCE
PROGRAM FOR TUESDAY, SEPTEMBER 13TH

View: session overviewtalk overview

09:00-09:30 Session Opening: Welcome and Keynote Presentation

Welcome

Denise Chiavetta and Alan Porter

Keynote "Democratizing Text: Fostering New Capabilities in Text Mining"

Scott Cunningham

 

Location: Red Cube Building
09:40-10:40 Session 1A: TECHMINING METHODS

.

Location: Red Cube Building
09:40
Lexical analysis of scientific publications for nano-level scientometrics
SPEAKER: unknown

ABSTRACT. Components of text analysis are often applied in scientometrics in combination with link-based techniques. The objective is usually to study the structure of medium-sized or large document sets or monitoring the evolution of research fields or topics at the global and local level. Taking up the objectives of evaluative scientometrics, we link textual analysis of individual scientific papers (or smaller sets of those) to evaluative aspects of bibliometrics. The object is quite similar to large-scale analysis: studying structures, detecting (dis-)similarities, monitoring evolution and detecting new trends. We proceed from approaches also used in quantitative linguistics but now with focus on bibliometrics. We studied 18 bibliometric papers published by András Schubert in the period 1983-2013 and created two data sets based on all words and on nouns only, respectively. In contrast to the traditional approach (e.g., in quantitative linguistics), where rank frequencies are used, we used frequency distributions, in particular, we applied a Waring model. For the parameter estimation we applied a hybrid MLE method. Both data sets provided similar parameters, substantiating that the parameter responsible for power-law property is close to the value 1.0, i.e., word frequency has no finite expectation. The exercise showed further that the distribution of word use remained stable over time but the core of most frequently used words changed over the period of about 20 years. The applied methodology can be used to detect the characteristics and changes in language use in scientific text and the emergence of new topics at the micro/nano level.

10:00
Early insights of Emerging Sources Citation Index (ESCI): a bibliometrics analysis and overlap mapping method
SPEAKER: unknown

ABSTRACT. With the rapid development of advanced science and technology, new journals emerge in large numbers. On one hand, these journals offer more valuable data source for bibliometrics scholars to accelerate research, scientific discovery and Intellectual Property innovation; on the other hand, how to pick up the peer-reviewed publications of regional importance and in emerging research fields becomes an essential issue for funders, key opinion leaders, and evaluators. Under such circumstance, Web of Science platform, as the world’s most trusted citation index covering the leading scholarly literature has launched the Emerging Sources Citation Index (ESCI) on November, 2015, to extend the universe of publications included in Science Citation Index Expanded (SCI-EXPANDED), Social Sciences Citation Index (SSCI) and Arts & Humanities Citation Index (A&HCI). In this paper, we want to carry out our study in the following three aspects: 1) Conduct the research profiling of the journals indexed in ESCI to explore publish activity characteristics; 2) Evaluate the influence of the journals indexed in ESCI in accelerating regional communication and expanding global collaboration; 3) Visualize the discipline distributions of the publications indexed in ESCI, especially in the emerging fields. The results show that ESCI present a positive effective on expending research assessment and expediting scientific collaboration. However, it still cannot bring much attempt to promoting the inferior role of non-English countries and regions. Furthermore, how to balance the journal selection of different research domains and then facilitate cross-disciplinary research still need further efforts.

10:20
Mapping Sugar Sweetened Beverage Policy: TechMining Across SCOPUS and LexisNexis
SPEAKER: Lexi White

ABSTRACT. The study of citation patterns in scientific research has been a fruitful area of study in recent years. Many scientometric researchers have investigated networks of research publications and indicators in a variety of databases. Little research, however, has focused on citation patterns among legal publications. Almost no research has looked at citation patterns between legal and scientific publications. Since legal publications are housed in different databases than scientific publications, they are excluded from large citation studies like those done by Leydesdorff and colleagues (2015) that probe databases such as SCOPUS and Web of Science. While the two primary legal publication databases LexisNexis and Westlaw are owned by science-giants Elsevier and Thompson Reuters respectively, they operate very differently and are not optimized to allow exploration of network patterns among the articles. This research seeks to explore citation patterns on a specific, bounded topic—sugar sweetened beverages—across not only scientific research, but also legal research.

09:40-10:40 Session 1B: TECHMINING INDICATORS
Location: Green Cube Building
09:40
Knowledge discontinuities, obsolescence and the rate of technology performance improvements
SPEAKER: unknown

ABSTRACT. Long-term economic growth relies on technology improvements. Differences in the quality of available technology across countries, sectors and firms are crucial determinants of productivity and output differentials. If we want to understand the fundamental sources of technical change we need to understand why some technologies improve much faster than others on a global scale. Moreover, the future direction of societal development depends on the relative rate of improvement of diverse technologies. Therefore, such understanding has important practical and policy implications as well.

In this research, we define a method to empirically identify discontinuities in engineering design trajectories and estimate the knowledge obsolescence rate in a technology domain using patent data. Discontinuities are understood as changes in the underlying approaches used to tackle engineering challenges. As such they can be measured by looking at changes in the main paths of citations in patent-networks. Knowledge obsolescence in a technology domain is measured as the speed of decay of the probability that the domain's patented inventions will receive citations as a function of their age.

We then test the hypothesis that faster rates of technology performance improvement in a technology domain are associated with a higher number (or more frequent arrival) of knowledge discontinuities in its engineering design trajectories and with faster rates of knowledge obsolescence. We use patent data and technology performance data for a set of 28 technology domains (such as integrated circuits, 3D printing, genome sequencing and solar photovoltaic) to test this hypothesis. We build a structural model that estimates the effect of the arrival rate of discontinuities on the rate of knowledge obsolescence and the rate of technology performance improvements.

10:00
Disciplinary Integration and the Role of Border Fields
SPEAKER: Jan Youtie

ABSTRACT. Encouraging interdisciplinary research has been a science policy goal, but barriers in terminology, tools and instruments, and analytic approaches exist. Our work focuses on the role of the border field in advancing flows of knowledge between two fields that are important in US efforts to improve science, technology, engineering and mathematics (STEM) education: education research and cognitive science. In the case of education research and cognitive science, it may be “a bridge too far” to think of dramatically increasing direct knowledge flows between these two fields. Border communities such as educational psychology can act as an intermediary or additional bridge between the two fields. We posit that there are three subfields that serve as border communities: educational psychology, human/computer interaction and learning technologies, and applied linguistics. These border communities are assumed to sit between cognitive science and education research, but at the same time apart as they are scholarly communities in their own right and with their own literatures. The extent to which educational psychology draws on cognitive science, draws on education research, and influences both communities is an open question. We examine this proposition by analyzing cited references in metadata from articles in the Web of Science published in five selected years in the 1994-2014 time period using journal and journal-category based definitions of the fields in question. Our results show there are relatively small direct citation rates between articles in education research and cognitive science and relatively larger rates by which each cites articles appearing in border field journals. Border fields would indeed appear to be situated at the border between education and cognitive science.

10:20
How the analysis of structural holes in academic discussions helps in understanding genesis of advanced technology
SPEAKER: unknown

ABSTRACT. Since early 1960-s there has been growing interest in the development and use of new technologies accompanied by a strong wish of decision makers to govern related processes at corporate and national levels. One of the key categories appeared to set up analytical and regulatory frameworks was a category of advanced technology. Primarily associated with computer electronics and microelectronics it soon got new meanings derived from the variety of professional discussions primarily in social and later in natural and engineering sciences. This paper focuses on the evolution of academic discussions on advanced technology and perception of the term as it is represented in professional discourses for the period from 1960 until 2015. In order to identify the major shifts in those discussions and therefore changes in understanding and classifying advanced technology we identified and examined ‘structural holes’ that appear in sets of co-citation networks taken in smaller intervals within the considered period. It is shown that papers appearing in two sequential intervals demonstrate higher levels of betweeness centrality. This allows considering them as important elements of the networks that fill in structural holes in the current period and may structure future communications.

10:40-11:00Coffee Break PURPLE CUBE BUILDING
11:05-12:05 Session 2A: TECHNICAL EMERGENCE
Location: Red Cube Building
11:05
Finding factors behind potential breakthrough papers
SPEAKER: unknown

ABSTRACT. Foreseeing the occurrence of scientific discoveries that have an above average impact on future research is a ‘holy grail’. Several scholars propose theoretical models that describe the evolution of science. In recent years we constructed and implemented a set of automatic computer algorithms for the early detection of scholarly ‘breakout’ papers by harvesting databases with bibliographic information. The algorithms originally developed for the early identification of breakout papers were adapated so they can be used to analyse the breakout character of individual publications at any point in time. This allows us to analyse the ‘environmental’ factors that influence a paper to become a breakout-paper.

The current focus on the collaboration of authors. Factors we currently focus on are (1) the size of a research group, (2) the organisational collaboration of the researchers, and (3) cross border collaboration. Special attention is given to the question ‘Does the influence of these factors depend on the age of a paper?’ Preliminary research shows that the majority of the papers get their breakout ‘status’ within two year after publication. For all type of organisational collaboration papers have an above change to be a breakout paper; papers from companies and hospitals alone have a below change being a breakout paper.

11:25
A Measure of Staying Power: Does the Persistence of Emergent Concepts Significantly Vary by Technology Space?

ABSTRACT. This study advances an indicator for technical emergence, an indicator grounded in the four components set forth by the Foresight and Understanding from Scientific Exposition (FUSE) Program: (i) novelty, (ii) persistence, (iii) community and (iv) growth. The emergence tool used in this analysis has been repeatedly tested on multiple datasets and demonstrated the ability to significantly predict the future of term usage—i.e. concepts it identifies as emergent significantly outperform their non-emergent peers. Not all terms that surface emergent enjoying their time in the spotlight indefinitely, however (i.e. some terms and some scientific domains benefit from emergence status longer than do others). Given that some emergent technology spaces persist for a longer time - or have greater staying power - than do others, a key objective of the present undertaking is to measure persistence by domain. The proposed dataset for this undertaking is a dye-sensitized solar cell (DSSC) dataset, which was built using a modularized Boolean approach for identifying this research on the Web of Science (WOS). Given their interdisciplinary nature, DSSCs can be deconstructed into 19 Meta Disciplines using a thesaurus devised by Alan Porter (Georgia Tech) that categorizes all scientific publications indexed on WOS. Identifying those domains that demonstrate greater persistence, or staying power, has significant implications for the future direction of R&D funding as well as other forms of attention. Emergent domains with greater staying power are expected to have longer and potentially broader impact. If the persistence of emergent concepts can be shown to significantly vary by domain within the DSSC framework, the application of this procedure to other technology spaces is encouraged as well.

11:45
Networks dynamics in the case of emerging technologies

ABSTRACT. This research aims at increasing our understanding on how collaborative networks form, evolve and are configured in the case of emerging technologies. Emerging technologies are technologies with the potential to exert a considerable socio-economic impact in the domain in which they emerge. They are radically novel, have already moved beyond the conceptual stage, and show relatively fast growth in terms of actors involved in knowledge production processes and outcomes of these processes (e.g. publications, patents, products/services) (Rotolo, Hicks, & Martin, 2015). Previous studies have extensively investigated the consequences of network variables on actors’ behaviour and performance, stressing the importance of networks to gain social, institutional and governance benefits and private advantages (Burt, 1992; Granovetter, 1983). Yet, the genesis and dynamics of networks is a largely unexplored area of research (Ahuja, Soda, & Zaheer, 2012; Gulati & Gargiulo, 1999). This is especially important in the context of emerging technologies. Given that these are in a state of flux, the architecture of networks (and the distribution of the benefits and advantages among the actors involved) is likely to change over the emergence process. This paper aims to fill this gap by conducting an in depth case-study analysis.

11:05-12:05 Session 2B: TECHMINING FOR INNOVATION STRATEGY
Location: Green Cube Building
11:05
Rejecting Moderation: An Entropy-based Indicator System for Measuring Patent Technological Innovation Potential
SPEAKER: Yi Zhang

ABSTRACT. How to evaluate patent value quantitatively and systematically is an intriguing scholarly topic for bibliometrics. This paper attempts to construct an entropy-based indicator system to measure the technological innovative capability of patents. One basic target is to identify significant patents with high technological innovation rather than those multi-dimensional moderate ones. This paper first proposes a patent indicator system that contains three macro-level perspectives: technological perspective, legal perspective, and market perspective. Each perspective is constituted by a number of patent indicators; we calculate the correlation of these indicators to make sure they are suitably independent variables. We, then, based on a small training set, apply a learning-based collaborative filtering technique to remove noise and reduce the scale of the target patent corpus. Shannon’s entropy (Shannon 1948), well-known as a coefficient for measuring complexity and uncertainty, is introduced to quantitatively weigh indicators. Its basic weighting criterion is that the more common an indicator is, the less weight it would have. In other words, patents with irregular indicator values would be ranked higher. We identify the entropy-weighted value as indicating technological innovative capability. The output of our method is a set of entropy-weighted patents. Aided by expert knowledge, it could be used to seek patents with technological values and innovative potential. Furthermore we consider how the entropy measures could serve to forecast possible technological recombination in the near future. We apply our method to all 26,982 patents with Chinese assignees in the United States Patent and Trademark Office (USPTO) database, covering the period from 1976 to 2014. The results demonstrate the feasibility and efficiency of our method, and also provide interesting insights for related Research & Development (R&D) planning and strategic management.

11:25
Integration of tech mining into technology roadmapping: advanced tools for creating, validating and updating technology roadmaps

ABSTRACT. This study explores the possibilities of using tech mining tools for creating, validating, and updating technology roadmaps taking the aviation area as an example. To achieve this goal it identifies the main trends, markets, technologies, and technological products in the aviation area using qualitative (expert procedures) and quantitative (bibliometrics) techniques; verifies these results with a help of tech mining software (Vantage Point, VOSviewer); and analyzes the potentials for using tech mining methods on different stages of technology roadmapping. The results of this research can be interesting for policy makers financing roadmapping activities in order to set priorities in science and technology; for practitioners scanning disruptive innovations in the most important markets to support their corporate strategies; and also for the scientific community contributing to further integration of qualitative and quantitative foresight methods.

The methodology of this study consists of three stages. First, the initial list of technology trends for the roadmap is created using quantitative (bibliometric analysis, etc.), and qualitative (surveys, interviews, consultations with aviation specialists) methods. Second, the quantitative analysis of Web of Science (WoS) publications in the aviation area in recent 10 years (2006-2015) is conducted. The collection of publications (created by using keywords provided by the experts and retrieved from WoS) is processed (cleaned and grouped), analyzed (based on the keywords co-occurrence), and visualized with the help of Vantage Point and VOSviewer. Finally, the results of quantitative analysis are used for validating the expert list of technology trends, markets, technologies, and technological products. The analysis is conducted through 3 rounds of discussions with experts.

As a result of quantitative procedures, the information about the possible technology trends, markets, technologies, and technological products in the aviation area for the period 2006-2015 was retrieved from the collection of WoS publications in order to validate the roadmap. The possibilities of employing tech mining tools on different stages of roadmapping were explored and analysed. It was concluded, that the following factors should be taken into account in using them: the time horizon of the study (f.e., strategic documents and international reports can be more useful for understanding long-term technology trends, etc.); the stage of the roadmap development (pre-roadmapping, desk research, expert procedures, creative analysis, interactive discussion); the type of the information needed (f.e., emerging technologies, research fronts, disruptive technologies); the sort of information sources (publications, patents, web-content), and others. Further research will be devoted to the detailed analysis of these factors, as well as to development of more systemic methodology for integrating tech mining tools into technology roadmapping.

11:45
The evolution of the disciplinary structure of Nanoscience & Nanotechology
SPEAKER: Chunjuan Luan

ABSTRACT. Nanoscience and nanotechnology (N&N), as a typical emerging, promising, critical and converging technology field, has attracted tremendous governmental funds and scientific efforts. With an explosive growth of N&N articles, studies on N&N have been widely conducted by information scientists worldwide, yet investigations concerning disciplinary structure of N&N seem to be deficient. The innovation of this paper lies in mapping the disciplinary network structure of subjects related N&N clearly by employing social network analysis tool Netdraw, identifying the Web of Science Categories (WCs) playing an mediating effect with indicator of nBetweenness centrality in different developing stages; especially, analysing clusters by applying cliques analysis among disciplines related to N&N, revealing close or distance, far or near among distinct areas pertinent to N&N. Results concluded in this paper can help us better understand the original knowledge source of N&N at the beginning stage, and the dynamic evolution of N&N over time.

12:10-13:40Lunch PURPLE CUBE BUILDING
13:00-13:40 Session EM: TECHMINING DATA and TOOL REVIEW

We invite you to join us for a series of short presentations by providers of TechMining tools and resources. Feel free to bring your lunch into the auditorium. Presenters will also participate in the evening Poster session.

Location: Red Cube Building
13:00
InnoGPS: A tool for innovation navigation and strategy
SPEAKER: unknown

ABSTRACT. The space of technologies is vast, with many domains of technology making complex connections to each other. An individual inventor may be expert in a few of these domains, but still know little about how to approach other domains. We previously created a map of 629 technology domains and their relationships, using citation data from 4 million patents to measure if each pair of domains interacted more than would be expected by chance. We later showed that inventors are likely to follow this map: an inventor who has previously patented in a domain, like “Shaping Plastics,” is over 240x more likely to successfully patent in a very related domain, like “Shaping Clay”, over an unrelated domain like “Pleating Textiles”. The map also predicted performance: inventors who entered domains related to their previous work patented more. We have now constructed a tool, InnoGPS, to enable individuals, organizations and governments to use this map to their advantage. InnoGPS enables users to visualize the space of technologies and locate their knowledge on the map. InnoGPS recommends nearby domains that the users may be able to enter more easily. Users can view further information about these recommended domains, including inspirational patents, top competitors, and recent news items. InnoGPS can also recommend multi-phase growth strategies, in which the user enters multiple domains in sequence in order to reach a distant target. InnoGPS can help individuals, organizations and governments to understand their existing knowledge portfolio and to create a strategy for their future growth.

13:10
A novel data source of global research funding to provide context for strategic planning and evaluation efforts
SPEAKER: unknown

ABSTRACT. An important dimension of portfolio planning and program evaluation for research funders is judging the contribution of a program within the context of other funding activities. Many programs specifically aim to address understudied areas of science. However, objectively assessing whether an area of science is indeed understudied remains a persistent challenge for administrators. Here we present a new tool, Dimensions for Funders, that has focused on the aggregation and design of a novel data source of global research funding, and leverages a series of text mining approaches to allow clearer definition of the topics of interest to research administrators and evaluators of research programs to more easily assess the specific contribution of a given program within the context of the global funding landscape. Dimensions allows for the real time display of the research investments based on roughly $1 trillion of research grants from over 200 funding bodies globally. The software facilitates analysis by existing categorization schemes such as the US National Institutes of Health Research Condition, and Disease Categorization (RCDC) and the Australian Bureau of Statistics Fields of Research (FOR), as well as the creation and sharing of custom categorization schemes to ensure transparency and reproducibility of results in reporting. We present examples of studies which use Dimensions to examine the global contributions to research diseases which are currently garnering significant attention globally.

13:20
VantagePoint and IISC PatStat
SPEAKER: Nils Newman

ABSTRACT. EPO Worldwide Patent Statistical Database (also known as EPO PATSTAT) has been specifically developed for use by government/intergovernmental organisations and academic institutions. IISC PatStat is a version formatted for automatic import and analysis via the VantagePoint Academic and VP Student Edition text-mining tools.

13:30
A worldwide linked trademark database for IP research

ABSTRACT. Researchers and policy makers are concerned with many international issues regarding trademarks, such as trademark squatting, cluttering, and dilution. Trademark application data can provide an evidence base to inform government policy regarding these issues, and can also produce quantitative insights into economic trends and brand dynamics. Currently, national trademark databases can provide insight into economic and brand dynamics at the national level, but gaining such insight at an international level is more difficult due to a lack of internationally linked trademark data. We are in the process of building a harmonised international trademark database (the “Patstat of trademarks”), in which equivalent trademarks have been identified across national offices.

We have developed a pilot database that incorporates 6.4 million U.S., 1.3 million Australian, and 0.5 million New Zealand trademark applications, spanning over 100 years. The database will be extended to incorporate trademark data from other participating intellectual property (IP) offices as they join the project. Confirmed partners include the United Kingdom, Canada, WIPO, and OHIM. We will continue to expand the scope of the project, and intend to include many more IP offices from around the world.

In addition to building the pilot database, we have developed a linking algorithm that identifies equivalent trademarks (TMs) across the three jurisdictions. The algorithm can currently be applied to all applications that contain TM text; i.e. around 96% of all applications. In its current state, the algorithm successfully identifies ~ 97% of equivalent TMs that are known to be linked a priori (due to shared international registration number).

Current estimates indicate that approximately 40% of candidate positive links identified by the algorithm are false positives. However, we expect the proportion of false positives to become far smaller as we continue to improve the linking algorithm.

A major part of improving the linking algorithm will involve combining it with a separate machine learning algorithm that we have recently developed, which exhibits very low false positive and false negative error rates. Briefly, the machine learning algorithm includes an image classification neural network that we adapted to match and disambiguate inventor names in patent records. It uses a novel matching technique whereby each pair of inventor records is compared by firstly converting the two records from raw text into an abstract visual representation, or “comparison image”. The neural network is able to learn important features within comparison images that indicate whether the two inventor records are likely to be a match (both inventor records refer to the same inventor) or non-match (records refer to different inventors). This is done by training the neural network on data that has been manually labelled as match/non-match. Tests on a sub-sample of the labelled data (withheld from the network during training) indicate error rates as low as ~ 1%. We are currently modifying this machine learning algorithm to match trademarks, rather than inventor names.

When complete, the internationally linked trademark database will be a valuable resource for researchers and policy-makers in fields such as econometrics, intellectual property rights, and brand policy.

13:45-15:00 Session 3A: METHODS IN RESEARCHER DISAMBIGUATION & CLASSIFICATION
Chair:
Location: Red Cube Building
13:45
Research on Author Name Disambiguation Based on Semantic Fingerprint
SPEAKER: Hongqi Han

ABSTRACT. Author name ambiguity is a kind of uncertain phenomenon, where authors of scholarly documents often share names which makes it hard to distinguish each author's work. Due to the homonym, readers or analysts are often confused when they search literatures. The author ambiguity problems have been an obstacle to efficient information retrieval in digital library age, causing incorrect identification between authors and their publications. Author name disambiguation is a fundamental step in mapping knowledge domains and in other bibliometric and scientometrics analyses. It has great practical application value, which makes great influence on marketers who wish to direct their advertisements to specific individuals. And also it is crucial for establishing new resources such as co-author networks, citation networks and collaboration networks. In the personalized search, automatic question answering, multi-document summarization, hot figure tracking and discovery, and other fields have been widely applied. Our goal is to find all publications that belong to a given author and distinguish them from publications of other authors who share the same name, and also sort out the erroneous entities due to name disambiguation in a fast and efficient way. The existing methods are usually unable to meet the demand of practical application, especially under the condition of rapid growth of the scientific literature. In this work, firstly, according to the characteristics of the scholarly documents, we extract the email addresses and the affiliations of authors from pre-processed documents. Then fingerprint generator is used to generate email fingerprints, affiliation fingerprints and text fingerprints, where text fingerprint is generated by semantic fingerprint algorithm (such as simhash) using the full text of a document. After that, the fingerprint comparator is used to compare fingerprint of an unknown article with fingerprints of same-name articles in database, where same-name authors in database have been disambiguated. Then claim decision maker is used to judge the unknown article belongs to which author in database or it is a new same-name author according to the result of the comparator. Finally, the arbiter is used to judge the belonging when more than one same-name authors claim the unknown article. In this paper, we propose name disambiguation method based on semantic fingerprint, the whole process does not involve the comparison of the original text, and the full text similarity is converted into the comparison of fixed length fingerprint. The method can dynamic build a fingerprint database and support incremental disambiguation instead of clustering all papers with same name authors in traditional methods.

14:05
Gender profiles in patenting: analysing female inventorship
SPEAKER: unknown

ABSTRACT. For many years the UK Government has been inspiring girls and women to study and build careers in STEM fields – science, technology, engineering and mathematics. Educational diversity statistics are comprehensive with the number of women attaining STEM qualifications in the UK increasing from 8% in 2011 to 24% in 2013. In industry, however, the statistics primarily rely on ‘inputs’ such as the number of women employed (in the UK, women account for 13% of the STEM workforce and only 5.5% of engineering professionals). Very little data is available on the ‘outputs’ of work undertaken by women within STEM industries.

Whilst absolute patent counts do not give a direct measure of innovation, they can be used to provide a measurable ‘output’ of STEM industries and it is highly desirable to analyse the inventor demographic in order to understand how inventor gender influences the patent system. Until now this data has been unobtainable but recent gender inference work has changed this.

In 2016 the Intellectual Property Office (UK IPO) undertook a preliminary study, taking baseline name-gender datasets and fusing them with GB patent data. The study shows that there has been a 16% increase in the proportion of female inventors on GB patent applications in the last 10 years. It goes on to compare the proportion of British female inventors against comparator countries and a technology breakdown of female inventors reveals a number of traditional associations. Further investigations include the proportion of female inventors working alone or as part of a team.

Following the successful trial using GB patent data, the UK IPO has built on previous gender work and expanded it to include all published patents worldwide using the EPO Worldwide Patent Statistics database, PATSTAT. It is now possible, with a high degree of confidence, to infer gender from inventor name data and provide statistical analysis about the patenting activity of female inventors. This research required the use of multiple datasets from disparate sources and used data cleaning and data manipulation techniques to provide a linked dataset that could be mined to provide useful intelligence on inventor gender. The results provide quantitative data to back up anecdotal evidence about female inventors within the IP industry, providing a sound basis for future evidence-based policy within government and industry.

The work further explores the nuances in the data and addresses more qualitative issues; for example, there are significant differences in the format and quality of names of inventors from different countries across the various patent jurisdictions and so certain groups require special consideration, which are discussed in more detail.

The study shows that patent data can be a good source of evidence to use in the wider gender debate. However, it should be used in conjunction with other data sources to form a bigger picture, which the paper also begins to explore.

14:25
Identifying Author Heritage Using Surname Data
SPEAKER: unknown

ABSTRACT. This research paper proposes a novel method to identify ethnic or national heritage of authors based on the morphology of their surnames. First name of authors is found to be a reliable indicator of gender (Meng and Shapira, 2010), while dealing with Chinese surname data is a widely recognised problem in bibliometric research that attracts multiple solutions (Tang and Walsh, 2010). However, similar studies for Slavic surnames are virtually non-existent. We develop this line of work by using the morphology of surnames as a key component of information retrieval in large imbalanced datasets (Chawla, 2005). We argue that surname morphology can serve as a reliable approximation of ethnic or national heritage of researchers and demonstrate this by developing a 2-step search procedure for post-Soviet surname data retrieval in nanotechnology publication dataset. The source of data for surname-based information retrieval is the particular structure of Russian surnames, namely, their patronymic suffix (Unbegaun, 1972). In case of Russian surnames, the simplest rule that uses only two most popular patronymic suffixes returns about 80% relevant results. Scenarios that add combinations of other endings and full exception surnames increase the recall rate of the method to 0.95. As the second step, a combination of Boolean string search and full exception names exclude false positive records and increase the precision of the search up to 0.98. The consistency of Russian heritage surname retrieval is maintained overall, with some national fluctuation. The findings of this research suggest that surname data can be used to identify communities of scientists or inventors based on shared country of origin (national – or ethnic, in mononational countries). The method developed and elaborated in this paper is a robust tool that can be used to solve a variety of tasks.For example, it contributes to improving solutions of the classic name disambiguation problem and can be used with little variation for identifying other Eastern European diasporas abroad, such as Czech, Bulgarian authors.

14:45
Leveraging ResearchGate for the purpose of Author Name Disambiguation
SPEAKER: unknown

ABSTRACT. Author name disambiguation task plays a very important role for individual based bibliometric analysis and has suffered from lack of information. Therefore, some have tried to leverage external Web sources to get extra evidence and they have obtained successful results. However, the main problem is generally the high cost of extracting data from the web pages due to the diverse design of contents. Considering this challenge, we mainly employ ResearchGate (RG), a social network platform for scholars including their publication lists and providing the data in a structured way. Even though the platform might be imperfect, it can be quite valuable when it is used along with some traditional approaches for the purpose of confirmation. To this end, first we apply a graph based machine learning approach, connected components and constitute clusters (CC). Among those clusters, a subset is drawn by retraining those having at least 10 members to examine the details. We additionally employ Google CSE API to access authors’ web pages as a complementary tool to RG. We observe almost the same F score (0.95) when only CC is applied and when CC confirmed by RG&CSE is applied. In addition, we observe that the found and confirmed publications through the external sources are relatively cited more than those publications not found in the related external sources. As a result, our suggested methodology has a potential to decrease the required manual work for individual based bibliometric analysis. Besides, it might present more reliable results by confirming cluster members derived by unsupervised grouping methods.

13:45-15:00 Session 3B: TECHMINING APPLICATIONS: Biomedical
Location: Green Cube Building
13:45
Pharma Competitive Intelligence using open source data and visualization tool: towards metrics for drug R&D milestones

ABSTRACT. Pharma-biotech industry is facing a changing world with many market challenges. This industry is switching its business model adopting new R&D management strategy based on the open innovation model. In that context, technology scouting is evolving toward pharma competitive intelligence based on scientific and business information to support decision-making process. A new approach could be build at the crossroad of technology forecasting and knowledge management TF would not rely only on patents as a major source of information but would exploit clinical trials and market authorization data transparency, as well further specific sources of information. KM would take advantage of the information flow during the drug R&D cycle times. Open sources databases with structured information were privileged, and we were able to identify specific sources of information for each milestone in drug R&D. The goal of this research was to implement a versatile data visualization tool, and we chose Tableau Desktop to visualize trends across market, bringing real-time clinical data, across geographies and demographics for pipeline decision makers. As an example, we will present a landscape of angiogenesis inhibitor drug and discuss metrics for drug R&D milestones. The analytical process that transforms disaggregated market and competitor data into relevant strategic knowledge that can be readily put to use by using opens source databases and data visualization tool. It could help biotech startups as well academic laboratories to manages flow of competitive intelligence information, learn about new technologies in the market and could benefit , and trends in the industry R&Dby improving discovery platforms or by reducing R&D costs.

14:05
What’s behind the curtain? – Dissecting the dynamics of evolution of emerging stem cell-based therapies
SPEAKER: unknown

ABSTRACT. Induced pluripotent stem cells (iPS cells) are expected to revolutionize our understanding, diagnosis, and treatment of diseases. Innovation research has focused on the application of these iPS cells in clinically-approved therapies, focusing on prescriptive solutions to overcome the ‘translational gaps’. Understanding of the processes underlying the emergence and progress of cell therapies is still unclear. This study explores the dynamic and largely intertwined processes involved in iPS cell-based therapies by focusing on three pathways of evolution, as suggested in the literature: (i) biomedical scientific understanding, (ii) development of medical technologies, and (iii) learning in clinical practice. The eye disease, age-related macular degeneration (AMD), the first iPS cell clinical transplant, is used as a case study. Multiple tech-mining approaches involving term maps, co-citation, and combined term-bibliographic coupling networks on scientific publications, further supported by interviews with key scientists, are proposed to reveal the dynamics within and between these aspects. Our results will provide an insight into the complexities involved in the development of stem cell-based therapies. Implications for the translation of knowledge into revolutionary stem cell-based clinical methods are discussed.

14:25
Applying Network Analysis Method to Aid Public Policy on Health: A Case Study in Brazil
SPEAKER: unknown

ABSTRACT. Databases on scientific publications are a well-known source for complex network analysis. The present work focuses on identifying synergy amongst researchers on Leishmaniasis, a Neglected Disease associated with poverty and very common in Brazil, India and many other countries in Latin America and Africa. Using Web of Science and PubMed database we have identified specific clusters related to collaboration between countries and its researchers. Based on those collaboration patterns and its evolution on the past 10 years we aim to find tendencies for research, specially related to treatments.

Because they are related to poverty, neglected diseases such as Leishmaniasis have traditionally ranked low on national and international health agendas. They present little incentives to the industry to invest in R&D, thus falling outside the pharmaceutical market. However, recent developments have drastically changed perspectives. Some of the countries affected by these diseases, such as Brazil and India, are now major emerging economies. In the last decade, governments and foundations have provided substantial funds intended for research programs. However, up to this point, we do not know how science has progressed and the consequences of those changes.

14:45
Discovering competitive strategies, accelerated with TechMining
SPEAKER: unknown

ABSTRACT. For centering the strategy through the analysis of patents one of the key fields to study is the claims field, where companies define the core invention they pretend to defend of potential competitors. While counting number of claims being it independent or dependent, may have not be very reliable as a measure of technology relevance, the analysis by natural language processing can extract and correlate the key elements of their technology, disclosed in independent claims and complemented in dependent ones. We have analyzed the content of the claims of at least two competitors in the animal production sector, and have plot the evolution of different components of a technology to see which are emerging, which are decreasing and, if new elements enter into the technology By means of the following steps: 1 to extract by means of natural language processing the terms of the claims 2 to remove common noisy terms via fuzzy matching, manual touchup using groups and further applying thesaurus grouping as well as some further clustering scripts 3 to generate a factors map to extract the main terms (term clumping strategy could also be used) Also not to forget that newest terms (still nascent) could also be critic 4 to plot such terms in Gantt / bubble map to see the emergence and trend of key terms and classifications 5 to compare the map of different competitors Also this key judgement could be complemented with a functional semantic TRIZ map to understand the role of each term.

15:00-15:20Coffee Break PURPLE CUBE BUILDING
15:25-16:25 Session 4A: TECHMINING CLUSTERING METHODS
Chair:
Location: Red Cube Building
15:25
From patents to technologies – A new level of analysis?
SPEAKER: unknown

ABSTRACT. In innovation research, patents are used as an indicator for the output of application-oriented research and development of a country that allows assessments of its current and future technological competitiveness. However, technologies generally are not only protected by a single patent but by a whole series of related patents, an effect that is employed by patent applicants in order to boost the protection against competitors. The high number of patents representing an individual technology complicates the performance of analyses of technologies. The objective of this feasibility study is to create an algorithm that aggregates similar patents representing a specific technology. This offers the possibility to perform analyses on the level of technologies instead of patents. It also allows interesting insights about the average number of patents protecting a technology or to identify core patents within a patent cluster. In order to determine the similarity between two patents, we propose a multi-step procedure that applies some essential basic conditions which have to be met by both candidate patents. In case two patents fulfill these conditions, more fine-grained criteria are employed in order to assess their similarity. For these criteria based on information that is included in the patent specification or emerges during the registration process, individual similarity measures and weights are established and combined in a normalized similarity measure that indicates how related two patents are. Based on this similarity computation, a clustering algorithm is applied in order to identify different technologies.

15:45
A very-short-text clustering method based on distributed representation to identifying research capabilities of a Higher Education Institution
SPEAKER: unknown

ABSTRACT. Text documents are an important source of data for tech mining techniques. Usually text databases include document sufficiently long to apply conventional text mining techniques. However in some tech mining tasks, such as capabilities identification process, we have database with very short texts, which represent a challenge for conventional text mining techniques. The problem has to do with the small number of terms that fail to provide enough statistical information to find any kind of relationships among the documents in the collection. The main purpose of this work is to show how to generate thematic clusters using only the titles of the research projects in one Higher Education Institution.

Working with short-text collections has become an important area of research in information retrieval and data mining, due to the proliferation of data sources with reduced textual information, e.g. blogs, reviews, tweets and other social-network and message-sharing platforms. Many researchers have focused efforts on techniques and applications of short-text clustering. In most of these works short texts correspond to documents with a handful of sentences. However, there are not works that concentrate on classification of very-short texts, i.e., that do not span more than one sentence.

16:05
Comparison of different “window-size” key phrase co-occurrence for knowledge representation
SPEAKER: unknown

ABSTRACT. Word/phrase co-occurrence (If two words/phrases p and p’ are seen in certain same window-size, they are usually related) is a basic method to find the word/phrase associations for various bibliometrics or informetrics analysis, such as semantic association, topic identification, theme cluster, knowledge structure profiling of certain domain. To construct an effective word/phrase co-occurrence matrix, there are two important factors, which refer to word/phrase selection and suitable word/phrase co-occurrence window size identification. In this paper, the authors focus on most effective co-occurrence window size identification for knowledge representation through comparing five different window sizes, which include Full text/paragraph size,Sentence-wise,Fixed window size,Syntactic relationship, Semantic relationship.

15:25-16:25 Session 4B: TECHMINING CAPABILITIES PANEL DISCUSSION

This session will foster a more in-depth exploration of the capabilities within the text-mining field, following on the comments of the conference morning keynote "Democratizing Text: Fostering New Capabilities in Text Mining."

Panelists:

Rich Corken
Scott Cunningham
Diana Hicks
Ismael Rafols
Daniele Rotolo

 

 

Location: Green Cube Building
17:00-19:00 Session : COCKTAIL RECEPTION
Location: Purple Cube Building
17:00-19:00 Session P: POSTER SESSION
Location: Purple Cube Building
17:00
FutureTDM: IMPROVE THE UPTAKE OF TEXT AND DATA MINING IN THE EU

ABSTRACT. As the use of content mining is significantly lower in Europe than in some countries in the Americas and Asia, the FutureTDM project seeks to improve uptake of text and data mining (TDM) in the EU. FutureTDM actively engages with stakeholders such as researchers, developers, publishers and SMEs and looks in depth at the TDM landscape in the EU to help pinpoint why uptake is lower, to raise awareness of TDM and to develop solutions.

17:00
Measuring technology activities of innovators in US fortune 500
SPEAKER: unknown

ABSTRACT. This study is significant for us to understand the state of the art of technology activities of innovators in US fortune 500, and further comprehend the relationship between company revenue and technology innovation. The distinctive innovation of this paper lies in measuring technology activities of innovators in US fortune 500, by employing 3 indicators of patent count, the number of different technology classifications and the total number of technology classifications; and further analyze the relationship between company revenue and technology innovation, and the relationship between technology innovation and the number of different technology classifications, by selecting another 2 indicators of correlation coefficient between company revenue and patent count, the correlation coefficient between patent count and the number of different technology classifications. Results show that some companies in the industry of Electronics & Electric, Aerospace & Defense, Chemicals, et al. tend to have a higher technology innovation level with more patent filings; the wide of technology impact of innovators in terms of the number of different technology classifications, and the depth of technology impact of innovators in terms of the total number of technology classifications, vary among different companies; Apple has a much higher correlation coefficient both between company revenue and patent count, and between patent count and the number of different technology classifications.

17:00
Do significant inventions involve more collaboration?
SPEAKER: Chunjuan Luan

ABSTRACT. This study explores whether significant inventions, defined as those with high citations, involve more collaboration than less-significant inventions with no citations. It also examines the dynamic evolutionary trends of the collaboration for both significant inventions and less-significant ones during the period of 1985-2010, and whether significant inventions collaborate more in each selected technology classification than less-significant ones. We employ patent co-inventor analysis, selecting patent data from the worldwide patent database, Derwent Innovations Index, DII, as sample data, and choosing the number of inventors per patent (TNOIPP) and the proportion of patents with multiple inventors (TPOPWMI) as two indicators. The results reveal that more collaboration on significant inventions than on less-significant inventions from an overview perspective. From 1985 to 2010, the dynamic evolutionary trends of the collaboration measured by the two indicators, TNOIPP and TPOPWMI, appear to be rising sharply for significant inventions, whereas there are no so obvious changes for less-significant inventions; as far as each selected technology classification is concerned, significant inventions do collaborate more than less-significant ones. The analysis results are of theoretical significance for they reveal a positive relationship from a quantitative perspective between collaboration and citation count for significant inventions, and may therefore aid in guiding research and development strategy.

17:00
New approach to the study of the impact of R&D&i Management System through the mining of the website.
SPEAKER: unknown

ABSTRACT. Understanding the impact of the Management Innovation Systems in a company remains a challenge due to the difficulty of recovering all the activities around the innovation process. Does the company have increased its innovative activities, after the certification process? We can document outputs of innovation such as, projects increases, increased collaborations, awards and patents registered.... This information can be achieved only with traditional sources. But we would like to find a way to predict early deployment of innovation in the company, for this reason, we can use their web pages business to recognize and quantify these changes. The approach to the study is very new. We wish to study the innovation outputs in companies (before and after of certification), using information from traditional sources and also mining the website. • Business databases: Sabi (Dun &Brasteed , supplier for 2,000,000 Spanish companies and 500,000 Portuguese companies) • Patent databases; Global Patent Index (GPI) database of the European patent office . • European Project database. • From the website we would like to recover information about changes in the organization before and after the certification. To get the information from the web prior to certification you must track the information stored in the Wayback Machine. For that task we are going to use two softwares: IBM Watson for scraping and Vantage Point for mining text in the websites.