Gender differs in how to say things. Age does in what to say.

Tracking #: 1633-2845

Seifeddine Mechti
Rim Faiz
Maher Jaoua
Frabcisco Rangel
Lamia Hadrich Belguih

Responsible editor: 
Guest Editors Benchmarking Linked Data 2017

Submission type: 
Full Paper
In this study, we present an original method for profiling the author of an anonymous English text. The aim of the proposed method is to determine the authors’ age and sex, especially authors of user-generated content in social media. To obtain the best classification, machine learning methods were used in previous works. However, two important details were ignored in the proposed approaches: (1) in most cases, authors are classified according to their speeches and the expressions they use, but this classification does not show the type of features useful for each dimension (age, sex).Our study is based on the hypothesis that gender depends on the writing style, while age depends on the text content. (2) Methods using the Bayesian networks did not yield the best results. Therefore, we propose a method relying on advanced Bayesian networks for age prediction based on content features and decision trees for gender detection based on stylistic features to overcome the previously-mentioned problems. Our experimentation proved that gender differs in how to say things whereas age differs in what to say. Our method showed a high accuracy level by achieving one of the best results at the PAN@CLEF 2013 shared task: we obtained the second rank for gender prediction and third rank for joint (age plus gender) identification.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 15/Jun/2017
Major Revision
Review Comment:


The authors perform classification on the age and gender of authors of text, and use both stylistic and content based features to do so. These features both appear to be frequency counts of groups of words. However, the method for determining these word groups are not well described. The central thesis is difficult to disentangle from the text, but appears to be that different age groups can be distinguished by their topic of conversation, while gender is better differentiated by the way language is used. This is a very general claim, but the dataset and features used appear to be very context specific and culturally embedded.

This paper is not of high enough quality to be considered for publication. There are numerous grammatical and technical errors. The related work is not sufficiently well described, nor a meaningful picture of the state-of-the-art drawn. The methodology is poorly described, and doesn't appear to represent a technical advance. The evaluation section lacks numerous technical details, and the results are difficult to interpret. The findings are mostly unclear, and those claims that are clear are not well supported by the evidence presented.

Specific responses to the first few sections

# This reads a little oddly throughout. It is difficult to understand exactly what this paper is setting out to do or the value of this.

# The meaning of this sentence is unclear. most cases, authors are classified according to their speeches and the expressions
they use, but this classification does not show the type of features useful for each dimension (age,sex)
# What is meant by the 'type of features', what is meant by 'classified according to their speeches'

# The authors justify using Bayesian networks (advanced ones) by saying that in past approaches Bayesian networks performed poorly. There needs to be a better justification of why the authors pursued this approach.
Methods using the Bayesian networks did not yield the best results. Therefore, we propose a method relying on advanced Bayesian networks for age prediction based on content features and decision trees for gender detection based on stylistic features to overcome the previously mentioned problems

# This sentence is seriously grammatically flawed, and the meaning is obscured.
By analyzing their speeches and the expressions, we use can identify their characters...

# I am not sure that this sentence is meaningful, but it is certainly grammatically flawed.
Recently and with the development and widespread of social networks,
such as Facebook, Twitter, etc, computer scientists
has become increasingly concerned with determining
the users’ identity,

# This sentence is very difficult to interpret too.
Concretely, this analysis may provide evidences to link
words use to personality, social and situational fluc-
tuations and psychological interventions.
# What are situational fluctuations? psychological interventions? How will an analysis of gender and age shed light on how these factors affect language use?

# This claim (and a number of others throughout the section) lacks a citation
For example, the particular use of some parts of speech, like
pronouns, articles, prepositions, conjunctions, and
auxiliary verbs, as well as their morphological varia-
tions, may serve as markers of emotional state, social
identity and cognitive styles

Related work
# The meaning of this is unclear:
...74% of the discussions were highly ranked.
# This is a classification approach not a ranking approach, so in what sense are the conversations highly ranked.

# distinguishing between similar work only discusses the preprocessing not the main technique. Even if it is the same it should be clear that they differ only in how they preprocess. Do Marquardt et al. remove URLs, or apply case conversion or character removal for instance? If so, then the stated distinction is not there, if they don't then it should be clear that they do not.
In their
work, Marquardt et al. [14] used HTML Cleaning to
obtain plain text and discrimination between human-
like posts and spam-like posts. However Ashok’s
approach [2] is based on the deletion of URLs, hash-
tags and user’s entries in Twitter. On the other hand,
Weren [15] applied case conversion, invalid charac-
ters and multiple white spaces removal as well as toke-
nization and the selection of sub-corpus.

# The related work seems to just list the names of techniques used without any analysis of the value of techniques or of bringing understanding to the field as a whole, e.g.
In [21], automated readability measures, such as rea-
dability index, Coleman-Liau index, Rix readability
index, Gunning Fog index and Flesch-Kinkaid index
were employed.
# Employed in what setting, to establish what? Were they successful?

# Claims without justification or support, e.g.
the major drawback of content-based attributes is that they de-
pend on the psychological and mental states of the
author (negative emotions, positive emotions) when
writing, which might distort the classification results.
# why might the results be distorted y using these attributes?

# page 3 left column ends abruptly and the sentence is not continued on the right column:
# left column ends:
Indeed they did not yield good results for author pro-
# right column starts( new paragraph):
In addition to the used style, the content of docu-
ments can be of great help in the classification
filing tasks

# page 3 right column section 3 first sentence ends abruptly and doesn't appear to be continued:
As described in Figure 1, our method consists of four
# then nothing.


# Authors should be specific about what they do. This is not explicit enough:
Preprocessing: The raw text was cleaned to remove noisy data like XML tags, Urls and hashtags.
# What specifically was removed.

# Here you appear to select the most frequent 200 words as content words:
We computed the number of occurrences of all the words in the corpus and ranked them
in descending order of appearance. Afterwards, the
top 200 were selected as content attributes.
# However, typically the top 200 words are likely to be function words, and/of/he/she. This doesn't make sense.

# Repeating a deterministic procedure a number of times should give exactly the same results. However:
Feature set generation: We repeated the last step
six times and grouped the attributes belonging to the
same class (age group or gender) together
# I think the authors intend to say that they perform the procedure once per class, but they do not state what the classes are at this stage. This suggests that there are 3 age classes, which seems very coarse.

# Again there is a lack of clarity:
Classifier: We used the data mining tool Weka 2 to
build our classifiers. We also employed decision
trees (J48) and the advanced Bayesian network to
predict the gender and the age dimension respective-
# Weka is a general data mining tool-kit. Which classifiers did the authors use Weka for. Not decision trees, and not their 'advanced' Bayesian network. At this stage there is still not discussion as to what is meant by an advanced Bayesian network.

# Ignoring the grammatical concerns, the following sentence suggests that the Figure is to explain decision trees in general. The figure seems to be more about describing how the authors used decision trees to classify the style of text documents.
Tree-view diagram shown in Figure 2 may help us
to better understand decision trees.

### At this stage, and in light of the serious failings throughout, I have stopped making specific comments and I simply discuss the contribution overall.

Performance isn't as good as existing techniques. What their approach claims to add is an understanding of what features are good at determining age or gender. However, if these approaches aren't as good as others, how can they convincingly claim that.

The approach they use to determin baseline accuracy is not described at all.

The authors appear to be saying in Section 2 that emotion based words are not reliable because they may be influenced by context (although this is not made explicit). However, they then use content words to differentiate age group. One would argue that content words are as dependent as sentiment words on context (if not more so).

On page 6, the authors finally discuss what they mean by advanced bayesian networks. This appears to be the use of additional dependencies in the network. However, the authors only really describe the settings they use in a toolbox (e.g. greedy search) without explaining the underlying theoretical meaning of the approach. This focus on settings in an existing toolbox, without a meaningful description of what is being done or why, also calls into question the research value that is being added by this paper.

Figure 2 is not good enough quality to read. Figures 3, 4, 5, 6, 7 and 8 do not have sufficient information to interpret them, e.g. a scale on the y-axis (and even on the x-axis in Figure 8). Figures 6 and 7 should really have error bars on them. Figure 8 should really give an idea of uncertainty too.

In general, methodological descriptions are too unclear, and not explicit enough, for these experiments to be repeated, which should be considered the minimum descriptive requirement for such experiments.

There are inconsistencies in terms of descriptions throughout. For instance, the age groups are at places described as 10s, 20s and 30s, and later as 13-17, 23-27 and 33-47. I am not sure that people in the age group (33-47) should be described as elderly.

Finally, the paper doesn't really make its thesis clear, justify it convincingly with results, or reflect meaningfully on how those results support it.

Review #2
By Panagiotis Papadakos submitted on 05/Sep/2017
Review Comment:

This paper presents a method for profiling the author of an anonymous English text (i.e. his age and gender). The authors claim that gender differs in writing style while age depends on the text content. In this direction they propose a method that is based on advanced Bayesian networks on content features for age and decision trees for gender detection.

Although I find interesting the claim that gender differs in stylistic-based features and age on content-based features, unfortunately the authors fail to convince me about the correctness of their approach and the validity of their results.

Generally, it is difficult to follow the applied approach as described in section 3. For example I can't understand the feature set extraction step. What is the last step that is repeated six times and why it is repeated six times? How are the attributes grouped together? Why nation is a stylistic-based feature and not a content-based one? Over which data have you build your model? A lot of those decisions seem adhoc and should be described and supported in detail.

My major concern with this paper though is its contribution. It describes a method that showed competitive results with the Spanish partition of the PAN-AP-2013, but this was 5 years ago! The authors should apply their approach to the new 5th Author Profiling Task at PAN 2017 and compare their results with current systems if they want to be competitive. The only improvement over the initial 2013 system is the utilization of advanced Bayesian networks for age, that show a slight improvement over the original decision tree approach.

Some other comments.

1) Figure 2 is unreadable.

2) Table 3 and Figure 3 describe the same thing,

3) There are a lot of typos and styling problems (lines that appear in the next column of the paper, like in the top of the second page, second column)

Review #3
Anonymous submitted on 07/Oct/2017
Review Comment:

The paper presents an approach based on a decision tree model developed for PAN@CLEF2013. The approach aims to detect the gender and age of the authors of pieces of text. The authors regard the approach as innovative by virtue of the features that they rely on, in particular the use of writing style for gender and text content for age. After an introduction of the matter at hand, the authors present a rather extensive state of the art. Thereafter, they present their 4-step method. The preprocessing and the text analysis steps are clear but the feature generation remains unclear. Especially, expressions such as "We repeated the last step" make this crucial step difficult to follow. Which last step do you mean? The text analysis? Feature set generation is step 3 and can hence not be the last. Please rewrite this section.

The figures aimed to drive the analysis of the feature selection and results are partly rather difficult to read. In particular, Fig. 2 is close to impossible to decipher and Fig. 4 should be created using vector graphics (you can generate them using freeware such as R by simply exporting the graphics as PDFs). After the evaluation, it becomes clear that the authors basically to their PAN work and applied a different type of machine learning from Weka to the features they had selected afore. Overall, the degree of originality is rather minimal. Moreover, the paper does demand quite a bit of polishing before it is ready for publication (see minor comments below). The significance of the results is not particularly high given that the authors have already worked no these features and the combination with another classifier does not achieve significantly better results than the state of the art.

Most importantly though, the paper is simply not relevant to the special issue. There is no relation to the topics of interest, which claim "For this special issue, we welcome articles presenting (1) novel benchmarks (including benchmarking results) as well as (2) novel insights pertaining to evaluating any of the steps of the the Linked Data lifecycle. More specifically, we are interested in articles presenting benchmarks across the entire Linked Data life cycle, benchmarks that rely on large datasets and insights obtained from benchmarking Linked Data management processes."
I am hence afraid I have to reject this paper. The author might want to consider a regular submission to this journal while putting an emphasis on the Semantic Web portion of their results, which I currently fail to see.

Minor comments
we use can => we can
discussed by an old person => by old persons?
by a teenager => by teenagers?
scientists has => have
words use to => words used to
in the marketing => from a marketing
works focused => works have focused
such as his/her => such as an author's
Formatting error right column of page 2
data base => database
Fig. 1: Please align the boxes (e.g., age, gender) and use the same font as the text's font
URL vs. Url => please be consistent (URL preferred)
Table 3 vs. Table III => please be consistent (use 3)
Formatting error => Large gap after table 4, please fix


minor revision

The paper presents a solution to the well-known and well-studied problem of author profiling. With the aim of predicting authors' gender and age group by analyzing the text composed by them. Experiments are run using an available dataset.Overall this is an interesting study, however, there are some issues that I would like to see addressed.
Major concerns:

What do the authors mean by "good classification rate of 0.58"? Is it accuracy, precision or recall?
- Missing feature significance analysis: it is important to know which features help the classifiers. It is almost always obvious in a regular classification setting to carry out a feature significance test to check the effectiveness of features.

The evaluation should report precision, recall, and F-score consistently for the baseline and the proposed system. I would like to suggest the authors to go through some standard Machine Learning/ NLP papers that solve classification problems to know more about how results are reported.