To receive RDX data, researchers who are approved for a research project must work with their institution’s data czar. Before delivering data through this program, edX works to obscure personally identifying information (PII) by obfuscating obvious identifiers. EdX executes the obfuscation process on every XBlock in the course, including the course itself.
The following topics describe the data obfuscation procedures.
To obfuscate PII and other sensitive information in RDX data packages, edX uses these methods to modify data.
EdX remaps SQL table column values that are foreign keys, such as the
user_id
and username
values in the auth_userprofile
table. This
process applies a format-preserving algorithm to the affected data values
across all table columns and event member fields. As a result, scripts and
software designed to process these values in institution-specific data packages
should continue to function as expected for an RDX data package.
Data removal effectively deletes a data value. This process replaces a value with NULL, a zero-length string, or zeros, based on the data type of the column or field.
EdX applies this method to files and columns that contain sensitive data or PII
that cannot be remapped to a usable value. Examples include the first_name
and last_name
columns in the auth_userprofile
table and
wiki_articlerevision.ip_address
.
The replacement method identifies text strings that have characteristics of PII. Identified values are replaced by token values. EdX typically applies this method to free text, including discussion posts and wiki articles.
The replacement method identifies and replaces the following values.
Email addresses in {name}@{destination}.{domain} format. All of the characters in this format must be ASCII.
Telephone numbers in common European and U.S. formats.
Usernames that match the value in auth_user.username
, with the exception
of usernames that begin or end with a punctuation mark such as an underscore
or hyphen.
The system identifies and replaces only the username of the person who is
associated with the event or row. For example, if a learner identifies a wiki
contribution with her username, the system replaces that username if it
matches the value in auth_user.username
.
Names that, after punctuation is removed, match any whitespace delimited
words of three or more characters in auth_userprofile.name
.
The system identifies and replaces only the name of the person who is
associated with the event or row. For example, if a learner introduces
himself on a course discussion page, the system replaces the name if it
matches a word in auth_userprofile.name
.
EdX uses tokens that identify the category of the information that was
replaced, including <<EMAIL>>
and <<PHONE_NUMBER>>
. As a result, the
general meaning of the original text can be inferred from the token.
For example, a learner adds this post to a course discussion.
Hi all,
My name is Jonathan M. Doe (johndoe), and I'm excited to be in this
class. Looking forward to connecting with everyone.
My email is [email protected], or you can call me at (123)321-1234.
Thanks,
-Jonathan
Assuming that the learner has values of “johndoe” in auth_user.username
and
“Jonathan Doe” in auth_userprofile.name
, the CommentThread.body
in
the RDX package appears as follows.
Hi all,
My name is <<FULLNAME>> M. <<FULLNAME>> (<<USERNAME>>), and I'm excited to
be in this class. Looking forward to connecting with everyone.
My email is <<EMAIL>>, or you can call me at <<PHONE_NUMBER>>.
Thanks,
-<<FULLNAME>>
Another example follows, in which the learner’s post does not include any
values that the replacement method identifies or replaces. In this case, the
CommentThread.body
in the RDX package is identical to the original post.
Hi everyone! My name is John, here's my info if you want to contact me!
Email: johnmdoe (AT) gmail (DOT) com
Twitter: @jmdoe
Mobile: 1233211234
course_structure
File¶An {org}-{course}-{date}-course_structure-{site}-analytics.json
file is
provided for each course. Its metadata
member field stores course settings
that can contain sensitive data such as the password used to authenticate a
third party service for the course.
Before packaging this file for an RDX research project, edX obfuscates data as follows.
EdX only includes an approved subset of the settings that can be defined in
the metadata
object.
EdX removes all of the other fields found in this object.
A new field, redacted_metadata
, is added to the file. This JSON array
lists all of the fields that edX removed.
An example follows.
{
"category": "course",
"metadata": {...},
"redacted_metadata": ["field_name"]
}
For more information about this file, see Course Content Data.
course
File¶An {org}-{course}-course-{site}-analytics.xml.tar.gz
file is provided for
each course. This compressed file contains exports of all of a course’s content
in a set of JSON and XML files.
Note
EdX does not support imports of course
files that contain
obfuscated data.
policy.json
File¶Like the course_structure
file, the policy.json
file found in the
course
file stores course-level settings. To remove data from this file,
edX uses the same procedure described for Obfuscated Data in the course_structure File.
Before processing, the course_structure
file lists settings by name with their specified values.
{
"course/course_name": {
"setting_name": "some_value",
}
}
After processing, the redacted_attributes
array is added to list any field
names removed by the obfuscation process.
{
"course/course_name": {
"redacted_attributes": ["setting_name"]
}
}
For more information about this file, see Course Policies.
The XML files found in the course
file store data for course content. The
files for course components can contain sensitive data that is defined at the
component level, such as passwords for third party services.
Before packaging the XML files for an RDX research project, edX obfuscates data as follows.
redacted_attributes
attribute lists all of the attributes that
edX removed.redacted_children
attribute lists all of the child nodes that edX
removed.For more information about this file, see Course Components (XBlocks).
auth_user
Tableauth_userprofile
Tablestudent_courseenrollment
Tableuser_api_usercoursetag
Tableteams_courseteammembership
Tablecourseware_studentmodule
Tablecertificates_generatedcertificate
Tableverify_student_verificationstatus
Tablewiki_article
Tablewiki_articlerevision
Tableauth_user
Table¶The following table lists the columns in the auth_user
table that can
contain PII and the obfuscation method that edX applies before the data is
included in an RDX package. For more information about this table, see
Columns in the auth_user Table.
Column Method id
Remap username
Remap. The result is in the format “username_{remapped id
value}”.first_name
Remove last_name
Remove Remove password
Remove status
Remove email_key
Remove avatar_type
Remove country
Remove show_country
Remove date_of_birth
Remove interesting_tags
Remove ignored_tags
Remove email_tag_filter_strategy
Remove display_tag_filter_strategy
Remove consecutive_days_visit_count
Remove
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
auth_userprofile
Table¶The following table lists the columns in the auth_userprofile
table that
can contain PII and the obfuscation method that edX applies before the data is
included in an RDX package. For more information about this table, see
Columns in the auth_userprofile Table.
Column Method user_id
Remap (same as auth_user.id
)name
Remove language
Remove location
Remove meta
Remove courseware
Remove mailing_address
Remove city
Remove bio
Remove
The self reported, optional values for gender
, year_of_birth
,
level_of_education
, goals
, and country
are not obfuscated.
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
student_courseenrollment
Table¶The following table lists the columns in the student_courseenrollment
table
that can contain PII and the obfuscation method that edX applies before the data
is included in an RDX package. For more information about this table, see
Columns in the student_courseenrollment Table.
Column Method user_id
Remap (same as auth_user.id
)
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
user_api_usercoursetag
Table¶The following table lists the columns in the user_api_usercoursetag
table
that can contain PII and the obfuscation method that edX applies before the data
is included in an RDX package. For more information about this table, see
Columns in the user_api_usercoursetag Table.
Column Method user_id
Remap (same as auth_user.id
)
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
teams_courseteammembership
Table¶The following table lists the columns in the teams_courseteammembership
table that can contain PII and the obfuscation method that edX applies before
the data is included in an RDX package. For more information about this table,
see Columns in the teams_courseteammembership Table.
Column Method user_id
Remap (same as auth_user.id
)
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
courseware_studentmodule
Table¶The following table lists the columns in the courseware_studentmodule
table
that can contain PII and the obfuscation method that edX applies before the
data is included in an RDX package. For more information about this table, see
Columns in the courseware_studentmodule Table.
Column Method student_id
Remap (same as auth_user.id
)state
Replace
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
certificates_generatedcertificate
Table¶The following table lists the columns in the
certificates_generatedcertificate
table that can contain PII and the
obfuscation method that edX applies before the data is included in an RDX
package. For more information about this table, see
Columns in the certificates_generatedcertificate Table.
Column Method user_id
Remap (same as auth_user.id
)download_url
Remove verify_uuid
Remove download_uuid
Remove name
Remove error_reason
Remove key
Remove
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
verify_student_verificationstatus
Table¶The following table lists the columns in the
verify_student_verificationstatus
table that can contain PII and the
obfuscation method that edX applies before the data is included in an RDX
package. For more information about this table, see
Columns in the verify_student_verificationstatus Table.
Column Method user_id
Remap (same as auth_user.id
)
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
wiki_article
Table¶The following table lists the columns in the wiki_article
table that can
contain PII and the obfuscation method that edX applies before the data is
included in an RDX package. For more information about this table, see
Fields in the wiki_article File.
Column Method owner_id
Remove group_id
Remove
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
wiki_articlerevision
Table¶The following table lists the columns in the wiki_articlerevision
table
that can contain PII and the obfuscation method that edX applies before the data
is included in an RDX package. For more information about this table, see
Fields in the wiki_articlerevision File.
Column Method automatic_log
Remove content
Replace ip_address
Remove user_id
Remap (same as auth_user.id
)user_message
Remove
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
teams_courseteam
SQL Table¶The obfuscation process does not change any of the values in the
teams_courseteam
table. For more information about this table, see
Columns in the teams_courseteam Table.
The following data is not included in RDX packages.
{org}-email_opt_in-{site}-analytics.csv
file. For more information about this report, see Institution-wide Data.user_id_map
SQL table. For more information about this table, see
Columns in the user_id_map Table.The following table lists fields in the CommentThread
and Comment
JSON
documents that can contain PII and the obfuscation method that edX applies
before the data is included in an RDX package. For more information about these
documents, see Discussion Forums Data.
Field Method Found in author_id
Remap (same as auth_user.id
)CommentThread
andComment
title
Replace CommentThread
onlyauthor_username
Remap (same as auth_user.username
)CommentThread
andComment
body
Replace CommentThread
andComment
votes
Remap (same as auth_user.id
)CommentThread
andComment
endorsement
(same asauth_user.id
)Remap CommentThread
onlyabuse_flaggers
(same asauth_user.id
)Remap CommentThread
onlyhistorical_abuse_flaggers
Remap (same as auth_user.id
)CommentThread
only
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
RDX data packages include both explicit events and implicit events.
The following topics list the implicit events that are included in RDX data packages.
Note
Implicit events are not instrumented by edX and are subject to change at any time.
The following table lists fields that can contain PII and that are common to all events, with the obfuscation method that edX applies before the data is included in an RDX package.
The common context
field, which can contain event-specific member fields,
is described in the context member fields topic. For more information about the fields that are common to
all events, see Common Fields.
Common Field Method host
Remove ip
Remove page
Remove referer
Remove username
Remap
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
context
Member Fields¶The following table lists member fields of the context
field that are
obfuscated when present for any event.
context
Member FieldMethod client
Remove device
orip
if presenthost
Remove ip
Remove path
Remove user_id
Remap username
Remap (same as auth_user.username
)
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
event
Member Fields¶The following table lists member fields of the event
field that are
obfuscated when present for any event.
Note
To search for string values to replace, the obfuscation process recursively traverses the entire event data structure. This table lists event member fields that typically include data that is removed or remapped. Additional event member fields can also include data that is replaced.
event Member Field |
Method |
---|---|
answer.file_upload_key |
Remove |
certificate_id |
Remove |
certificate_url |
Remove |
fileName |
Remove |
GET |
Remove |
instructor |
Remap (same as auth_user.username ) |
POST |
Remove |
report_url |
Remove |
requesting_student_id |
Remove |
saved_response.file_upload_key |
Remove |
student |
Remap (same as auth_user.username ) |
source_url |
Remove |
url |
Remove |
url_name |
Remove |
user |
Remap (same as auth_user.username ) |
user_id |
Remap (same as auth_user.id ) |
username |
Remap (same as auth_user.username ) |
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.