To receive RDX data, researchers who are approved for a research project must work with their institution’s data czar. Before delivering data through this program, edX works to obscure personally identifying information (PII) by obfuscating obvious identifiers. EdX executes the obfuscation process on every XBlock in the course, including the course itself.
The following topics describe the data obfuscation procedures.
To obfuscate PII and other sensitive information in RDX data packages, edX uses these methods to modify data.
EdX remaps SQL table column values that are foreign keys, such as the
user_id and username values in the auth_userprofile table. This
process applies a format-preserving algorithm to the affected data values
across all table columns and event member fields. As a result, scripts and
software designed to process these values in institution-specific data packages
should continue to function as expected for an RDX data package.
Data removal effectively deletes a data value. This process replaces a value with NULL, a zero-length string, or zeros, based on the data type of the column or field.
EdX applies this method to files and columns that contain sensitive data or PII
that cannot be remapped to a usable value. Examples include the first_name
and last_name columns in the auth_userprofile table and
wiki_articlerevision.ip_address.
The replacement method identifies text strings that have characteristics of PII. Identified values are replaced by token values. EdX typically applies this method to free text, including discussion posts and wiki articles.
The replacement method identifies and replaces the following values.
Email addresses in {name}@{destination}.{domain} format. All of the characters in this format must be ASCII.
Telephone numbers in common European and U.S. formats.
Usernames that match the value in auth_user.username, with the exception
of usernames that begin or end with a punctuation mark such as an underscore
or hyphen.
The system identifies and replaces only the username of the person who is
associated with the event or row. For example, if a learner identifies a wiki
contribution with her username, the system replaces that username if it
matches the value in auth_user.username.
Names that, after punctuation is removed, match any whitespace delimited
words of three or more characters in auth_userprofile.name.
The system identifies and replaces only the name of the person who is
associated with the event or row. For example, if a learner introduces
himself on a course discussion page, the system replaces the name if it
matches a word in auth_userprofile.name.
EdX uses tokens that identify the category of the information that was
replaced, including <<EMAIL>> and <<PHONE_NUMBER>>. As a result, the
general meaning of the original text can be inferred from the token.
For example, a learner adds this post to a course discussion.
Hi all,
My name is Jonathan M. Doe (johndoe), and I'm excited to be in this
class. Looking forward to connecting with everyone.
My email is [email protected], or you can call me at (123)321-1234.
Thanks,
-Jonathan
Assuming that the learner has values of “johndoe” in auth_user.username and
“Jonathan Doe” in auth_userprofile.name, the CommentThread.body in
the RDX package appears as follows.
Hi all,
My name is <<FULLNAME>> M. <<FULLNAME>> (<<USERNAME>>), and I'm excited to
be in this class. Looking forward to connecting with everyone.
My email is <<EMAIL>>, or you can call me at <<PHONE_NUMBER>>.
Thanks,
-<<FULLNAME>>
Another example follows, in which the learner’s post does not include any
values that the replacement method identifies or replaces. In this case, the
CommentThread.body in the RDX package is identical to the original post.
Hi everyone! My name is John, here's my info if you want to contact me!
Email: johnmdoe (AT) gmail (DOT) com
Twitter: @jmdoe
Mobile: 1233211234
course_structure File¶An {org}-{course}-{date}-course_structure-{site}-analytics.json file is
provided for each course. Its metadata member field stores course settings
that can contain sensitive data such as the password used to authenticate a
third party service for the course.
Before packaging this file for an RDX research project, edX obfuscates data as follows.
EdX only includes an approved subset of the settings that can be defined in
the metadata object.
EdX removes all of the other fields found in this object.
A new field, redacted_metadata, is added to the file. This JSON array
lists all of the fields that edX removed.
An example follows.
{
"category": "course",
"metadata": {...},
"redacted_metadata": ["field_name"]
}
For more information about this file, see Course Content Data.
course File¶An {org}-{course}-course-{site}-analytics.xml.tar.gz file is provided for
each course. This compressed file contains exports of all of a course’s content
in a set of JSON and XML files.
Note
EdX does not support imports of course files that contain
obfuscated data.
policy.json File¶Like the course_structure file, the policy.json file found in the
course file stores course-level settings. To remove data from this file,
edX uses the same procedure described for Obfuscated Data in the course_structure File.
Before processing, the course_structure file lists settings by name with their specified values.
{
"course/course_name": {
"setting_name": "some_value",
}
}
After processing, the redacted_attributes array is added to list any field
names removed by the obfuscation process.
{
"course/course_name": {
"redacted_attributes": ["setting_name"]
}
}
For more information about this file, see Course Policies.
The XML files found in the course file store data for course content. The
files for course components can contain sensitive data that is defined at the
component level, such as passwords for third party services.
Before packaging the XML files for an RDX research project, edX obfuscates data as follows.
redacted_attributes attribute lists all of the attributes that
edX removed.redacted_children attribute lists all of the child nodes that edX
removed.For more information about this file, see Course Components (XBlocks).
auth_user Tableauth_userprofile Tablestudent_courseenrollment Tableuser_api_usercoursetag Tableteams_courseteammembership Tablecourseware_studentmodule Tablecertificates_generatedcertificate Tableverify_student_verificationstatus Tablewiki_article Tablewiki_articlerevision Tableauth_user Table¶The following table lists the columns in the auth_user table that can
contain PII and the obfuscation method that edX applies before the data is
included in an RDX package. For more information about this table, see
Columns in the auth_user Table.
Column Method idRemap usernameRemap. The result is in the format “username_{remapped idvalue}”.first_nameRemove last_nameRemove Remove passwordRemove statusRemove email_keyRemove avatar_typeRemove countryRemove show_countryRemove date_of_birthRemove interesting_tagsRemove ignored_tagsRemove email_tag_filter_strategyRemove display_tag_filter_strategyRemove consecutive_days_visit_countRemove
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
auth_userprofile Table¶The following table lists the columns in the auth_userprofile table that
can contain PII and the obfuscation method that edX applies before the data is
included in an RDX package. For more information about this table, see
Columns in the auth_userprofile Table.
Column Method user_idRemap (same as auth_user.id)nameRemove languageRemove locationRemove metaRemove coursewareRemove mailing_addressRemove cityRemove bioRemove
The self reported, optional values for gender, year_of_birth,
level_of_education, goals, and country are not obfuscated.
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
student_courseenrollment Table¶The following table lists the columns in the student_courseenrollment table
that can contain PII and the obfuscation method that edX applies before the data
is included in an RDX package. For more information about this table, see
Columns in the student_courseenrollment Table.
Column Method user_idRemap (same as auth_user.id)
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
user_api_usercoursetag Table¶The following table lists the columns in the user_api_usercoursetag table
that can contain PII and the obfuscation method that edX applies before the data
is included in an RDX package. For more information about this table, see
Columns in the user_api_usercoursetag Table.
Column Method user_idRemap (same as auth_user.id)
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
teams_courseteammembership Table¶The following table lists the columns in the teams_courseteammembership
table that can contain PII and the obfuscation method that edX applies before
the data is included in an RDX package. For more information about this table,
see Columns in the teams_courseteammembership Table.
Column Method user_idRemap (same as auth_user.id)
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
courseware_studentmodule Table¶The following table lists the columns in the courseware_studentmodule table
that can contain PII and the obfuscation method that edX applies before the
data is included in an RDX package. For more information about this table, see
Columns in the courseware_studentmodule Table.
Column Method student_idRemap (same as auth_user.id)stateReplace
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
certificates_generatedcertificate Table¶The following table lists the columns in the
certificates_generatedcertificate table that can contain PII and the
obfuscation method that edX applies before the data is included in an RDX
package. For more information about this table, see
Columns in the certificates_generatedcertificate Table.
Column Method user_idRemap (same as auth_user.id)download_urlRemove verify_uuidRemove download_uuidRemove nameRemove error_reasonRemove keyRemove
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
verify_student_verificationstatus Table¶The following table lists the columns in the
verify_student_verificationstatus table that can contain PII and the
obfuscation method that edX applies before the data is included in an RDX
package. For more information about this table, see
Columns in the verify_student_verificationstatus Table.
Column Method user_idRemap (same as auth_user.id)
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
wiki_article Table¶The following table lists the columns in the wiki_article table that can
contain PII and the obfuscation method that edX applies before the data is
included in an RDX package. For more information about this table, see
Fields in the wiki_article File.
Column Method owner_idRemove group_idRemove
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
wiki_articlerevision Table¶The following table lists the columns in the wiki_articlerevision table
that can contain PII and the obfuscation method that edX applies before the data
is included in an RDX package. For more information about this table, see
Fields in the wiki_articlerevision File.
Column Method automatic_logRemove contentReplace ip_addressRemove user_idRemap (same as auth_user.id)user_messageRemove
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
teams_courseteam SQL Table¶The obfuscation process does not change any of the values in the
teams_courseteam table. For more information about this table, see
Columns in the teams_courseteam Table.
The following data is not included in RDX packages.
{org}-email_opt_in-{site}-analytics.csv
file. For more information about this report, see Institution-wide Data.user_id_map SQL table. For more information about this table, see
Columns in the user_id_map Table.The following table lists fields in the CommentThread and Comment JSON
documents that can contain PII and the obfuscation method that edX applies
before the data is included in an RDX package. For more information about these
documents, see Discussion Forums Data.
Field Method Found in author_idRemap (same as auth_user.id)CommentThreadandCommenttitleReplace CommentThreadonlyauthor_usernameRemap (same as auth_user.username)CommentThreadandCommentbodyReplace CommentThreadandCommentvotesRemap (same as auth_user.id)CommentThreadandCommentendorsement(same asauth_user.id)Remap CommentThreadonlyabuse_flaggers(same asauth_user.id)Remap CommentThreadonlyhistorical_abuse_flaggersRemap (same as auth_user.id)CommentThreadonly
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
RDX data packages include both explicit events and implicit events.
The following topics list the implicit events that are included in RDX data packages.
Note
Implicit events are not instrumented by edX and are subject to change at any time.
The following table lists fields that can contain PII and that are common to all events, with the obfuscation method that edX applies before the data is included in an RDX package.
The common context field, which can contain event-specific member fields,
is described in the context member fields topic. For more information about the fields that are common to
all events, see Common Fields.
Common Field Method hostRemove ipRemove pageRemove refererRemove usernameRemap
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
context Member Fields¶The following table lists member fields of the context field that are
obfuscated when present for any event.
contextMember FieldMethod clientRemove deviceoripif presenthostRemove ipRemove pathRemove user_idRemap usernameRemap (same as auth_user.username)
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.
event Member Fields¶The following table lists member fields of the event field that are
obfuscated when present for any event.
Note
To search for string values to replace, the obfuscation process recursively traverses the entire event data structure. This table lists event member fields that typically include data that is removed or remapped. Additional event member fields can also include data that is replaced.
event Member Field |
Method |
|---|---|
answer.file_upload_key |
Remove |
certificate_id |
Remove |
certificate_url |
Remove |
fileName |
Remove |
GET |
Remove |
instructor |
Remap (same as auth_user.username) |
POST |
Remove |
report_url |
Remove |
requesting_student_id |
Remove |
saved_response.file_upload_key |
Remove |
student |
Remap (same as auth_user.username) |
source_url |
Remove |
url |
Remove |
url_name |
Remove |
user |
Remap (same as auth_user.username) |
user_id |
Remap (same as auth_user.id) |
username |
Remap (same as auth_user.username) |
For more information about how edX changes the data in these fields, see Data Obfuscation Methods.